Root Mean Squared Logarithmic Error (RMSLE) — Regression Metric (From Scratch)#
RMSLE measures error in log space: it is the RMSE between \(\log(1 + y)\) and \(\log(1 + \hat y)\).
It is most useful when targets are non-negative, span orders of magnitude, and you care about multiplicative / percentage-like errors.
Goals
Build intuition with numeric examples + Plotly visuals
Write RMSLE/MSLE in clear notation (including domain constraints)
Implement
root_mean_squared_log_errorin NumPy (from scratch) and validate vs scikit-learnShow how RMSLE naturally leads to optimizing a model on a
log1p-transformed targetSummarize pros/cons, good use cases, and common pitfalls
Quick import#
from sklearn.metrics import root_mean_squared_log_error
Equivalent: np.sqrt(mean_squared_log_error(...)).
import numpy as np
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
import os
import plotly.io as pio
from sklearn.metrics import (
mean_squared_error,
mean_squared_log_error,
root_mean_squared_error,
root_mean_squared_log_error,
)
pio.templates.default = "plotly_white"
pio.renderers.default = os.environ.get("PLOTLY_RENDERER", "notebook")
np.set_printoptions(precision=4, suppress=True)
rng = np.random.default_rng(42)
Prerequisites#
Regression setup: true targets \(y\) and predictions \(\hat y\)
Logarithms and the
log1p/expm1trick:log1p(y) = log(1 + y)is stable when \(y\) is near 0expm1(z) = exp(z) - 1is the inverse oflog1p
1) Definition and notation#
Given \(n\) samples with non-negative targets \(y_i \ge 0\) and predictions \(\hat y_i \ge 0\), define the log-transformed values:
The mean squared logarithmic error (MSLE) is:
The root mean squared logarithmic error (RMSLE) is:
Weighted variant with sample weights \(w_i \ge 0\):
Key identity (what makes this metric convenient):
Notes:
logis the natural logarithm; using another base just scales the metric by a constant.For multi-output regression, implementations typically compute RMSLE per output and then average.
2) Domain constraints and edge cases#
Non-negativity: Most definitions (and scikit-learn) require \(y \ge 0\) and \(\hat y \ge 0\).
Zeros are fine:
log1p(0) = 0, which is whylog(1 + y)is used instead oflog(y).Negative predictions: A linear model can output negative values; for RMSLE you often
use a model that enforces \(\hat y \ge 0\) (e.g., predict in log space), or
clip: \(\hat y \leftarrow \max(\hat y, 0)\) at evaluation time (common in practice).
Near zero, it behaves like squared error: for small \(y\), \(\log(1+y) \approx y\).
For large values, it behaves like squared relative error: for large \(y\), \(\log(1+y) \approx \log(y)\).
vals = np.array([0.0, 0.1, 1.0, 10.0, 100.0])
pd.DataFrame(
{
"y": vals,
"log1p(y)": np.log1p(vals),
"expm1(log1p(y))": np.expm1(np.log1p(vals)),
}
)
| y | log1p(y) | expm1(log1p(y)) | |
|---|---|---|---|
| 0 | 0.0 | 0.000000 | 0.0 |
| 1 | 0.1 | 0.095310 | 0.1 |
| 2 | 1.0 | 0.693147 | 1.0 |
| 3 | 10.0 | 2.397895 | 10.0 |
| 4 | 100.0 | 4.615121 | 100.0 |
3) Intuition: RMSLE cares about ratios (mostly)#
For large targets, the +1 becomes negligible and:
So for large \(y\), a prediction that is off by a factor of \(c\) (i.e., \(\hat y = c y\)) has error approximately:
This means:
Overpredicting by \(\times 2\) and underpredicting by \(\div 2\) have the same penalty (because \(\log(2)\) and \(\log(1/2) = -\log(2)\) square to the same value).
The metric is much less dominated by very large targets than RMSE/MSE.
For small targets, \(\log(1+y) \approx y\), so the metric behaves closer to squared error on the original scale.
ratios = np.logspace(-2, 2, 500) # 0.01 .. 100
y_trues = [0.1, 1.0, 10.0, 100.0, 1000.0]
parts = []
for y in y_trues:
y_pred = ratios * y
parts.append(
pd.DataFrame(
{
"ratio": ratios,
"sq_log_error": (np.log1p(y_pred) - np.log1p(y)) ** 2,
"series": f"y_true={y:g}",
}
)
)
# large-y approximation: (log ratio)^2
parts.append(
pd.DataFrame(
{
"ratio": ratios,
"sq_log_error": (np.log(ratios)) ** 2,
"series": "(log ratio)^2 (large-y approx)",
}
)
)
df_ratio = pd.concat(parts, ignore_index=True)
fig = px.line(
df_ratio,
x="ratio",
y="sq_log_error",
color="series",
log_x=True,
title="Per-sample squared log error vs multiplicative ratio",
labels={
"ratio": "ratio = y_pred / y_true",
"sq_log_error": "(log1p(y_pred) - log1p(y_true))^2",
"series": "curve",
},
)
fig.add_vline(x=1.0, line_dash="dash", line_color="black")
fig.show()
4) A tiny worked example#
We’ll compute RMSLE step-by-step and compare against scikit-learn.
y_true = np.array([0.0, 1.0, 10.0, 100.0])
y_pred = np.array([0.0, 2.0, 8.0, 120.0])
t_true = np.log1p(y_true)
t_pred = np.log1p(y_pred)
diff = t_pred - t_true
msle = float(np.mean(diff**2))
rmsle = float(np.sqrt(msle))
print("t_true:", t_true)
print("t_pred:", t_pred)
print("diff:", diff)
print("MSLE:", msle)
print("RMSLE:", rmsle)
print("sklearn MSLE:", mean_squared_log_error(y_true, y_pred))
print("sklearn RMSLE:", root_mean_squared_log_error(y_true, y_pred))
t_true: [0. 0.6931 2.3979 4.6151]
t_pred: [0. 1.0986 2.1972 4.7958]
diff: [ 0. 0.4055 -0.2007 0.1807]
MSLE: 0.05932808530023383
RMSLE: 0.24357357266385413
sklearn MSLE: 0.05932808530023383
sklearn RMSLE: 0.24357357266385413
df_example = pd.DataFrame(
{
"i": np.arange(len(y_true)),
"y_true": y_true,
"y_pred": y_pred,
"log1p(y_true)": t_true,
"log1p(y_pred)": t_pred,
"sq_log_error": diff**2,
}
)
fig = px.bar(
df_example,
x="i",
y="sq_log_error",
hover_data=["y_true", "y_pred", "log1p(y_true)", "log1p(y_pred)"],
title="Per-sample MSLE contribution (squared log error)",
labels={"i": "sample index", "sq_log_error": "(log1p(y_pred) - log1p(y_true))^2"},
)
fig.show()
5) RMSLE vs RMSE: what changes when you take logs?#
Consider targets that span orders of magnitude.
With RMSE, a fixed relative error (say +20%) produces much larger absolute residuals for large targets, so large targets dominate the metric.
With RMSLE, a fixed relative error produces approximately the same log residual, so the contributions are more balanced.
y_true_scale = np.array([1.0, 10.0, 100.0, 1000.0])
# Scenario A: same relative error (20% over)
y_pred_rel = 1.2 * y_true_scale
# Scenario B: same absolute error (+10)
y_pred_abs = y_true_scale + 10.0
def sq_error(y_t, y_p):
return (y_p - y_t) ** 2
def sq_log_error(y_t, y_p):
return (np.log1p(y_p) - np.log1p(y_t)) ** 2
df_scale = pd.concat(
[
pd.DataFrame(
{
"scenario": "20% over",
"y_true": y_true_scale,
"y_pred": y_pred_rel,
"squared error": sq_error(y_true_scale, y_pred_rel),
"squared log error": sq_log_error(y_true_scale, y_pred_rel),
}
),
pd.DataFrame(
{
"scenario": "+10 absolute",
"y_true": y_true_scale,
"y_pred": y_pred_abs,
"squared error": sq_error(y_true_scale, y_pred_abs),
"squared log error": sq_log_error(y_true_scale, y_pred_abs),
}
),
],
ignore_index=True,
)
df_long = df_scale.melt(
id_vars=["scenario", "y_true", "y_pred"],
value_vars=["squared error", "squared log error"],
var_name="term",
value_name="contribution",
)
fig = px.bar(
df_long,
x="y_true",
y="contribution",
color="term",
barmode="group",
facet_col="scenario",
log_y=True,
title="Per-sample contributions: RMSE/MSE vs RMSLE/MSLE",
labels={"y_true": "target (y_true)", "contribution": "contribution (log scale)"},
)
fig.show()
for name, yp in [("20% over", y_pred_rel), ("+10 absolute", y_pred_abs)]:
rmse = root_mean_squared_error(y_true_scale, yp)
rmsle = root_mean_squared_log_error(y_true_scale, yp)
print(f"{name:>11} | RMSE={rmse:.4f} | RMSLE={rmsle:.4f}")
20% over | RMSE=100.5038 | RMSLE=0.1603
+10 absolute | RMSE=10.0000 | RMSLE=0.9536
6) NumPy implementation (from scratch)#
We’ll implement MSLE and RMSLE with scikit-learn-like handling:
1D and 2D targets (
(n_samples,)or(n_samples, n_outputs))Optional
sample_weightmultioutput∈ {"raw_values","uniform_average"} or explicit output weights
def _as_2d(y):
y = np.asarray(y, dtype=float)
if y.ndim == 1:
return y.reshape(-1, 1)
if y.ndim == 2:
return y
raise ValueError("y must be 1D or 2D (n_samples,) or (n_samples, n_outputs).")
def _check_non_negative(y, *, name):
if np.any(y < 0):
raise ValueError(f"{name} contains negative values; RMSLE/MSLE require y >= 0.")
def mean_squared_log_error_np(y_true, y_pred, *, sample_weight=None, multioutput="uniform_average"):
"""Mean squared logarithmic error (MSLE).
MSLE(y, y_hat) = mean((log1p(y_hat) - log1p(y))^2)
"""
y_true_2d = _as_2d(y_true)
y_pred_2d = _as_2d(y_pred)
if y_true_2d.shape != y_pred_2d.shape:
raise ValueError(f"shape mismatch: y_true{y_true_2d.shape} vs y_pred{y_pred_2d.shape}")
_check_non_negative(y_true_2d, name="y_true")
_check_non_negative(y_pred_2d, name="y_pred")
t_true = np.log1p(y_true_2d)
t_pred = np.log1p(y_pred_2d)
residual = t_pred - t_true
if sample_weight is None:
msle_per_output = np.mean(residual**2, axis=0)
else:
w = np.asarray(sample_weight, dtype=float)
if w.ndim != 1:
raise ValueError("sample_weight must be 1D of shape (n_samples,).")
if w.shape[0] != y_true_2d.shape[0]:
raise ValueError("sample_weight length must match n_samples.")
w = w.reshape(-1, 1)
msle_per_output = np.sum(w * residual**2, axis=0) / np.sum(w, axis=0)
if multioutput == "raw_values":
return msle_per_output
if multioutput == "uniform_average":
return float(np.mean(msle_per_output))
weights = np.asarray(multioutput, dtype=float)
if weights.shape != (msle_per_output.shape[0],):
raise ValueError("multioutput weights must match n_outputs.")
return float(np.average(msle_per_output, weights=weights))
def root_mean_squared_log_error_np(
y_true, y_pred, *, sample_weight=None, multioutput="uniform_average"
):
"""Root mean squared logarithmic error (RMSLE): sqrt(MSLE)."""
msle_per_output = mean_squared_log_error_np(
y_true,
y_pred,
sample_weight=sample_weight,
multioutput="raw_values",
)
rmsle_per_output = np.sqrt(msle_per_output)
if multioutput == "raw_values":
return rmsle_per_output
if multioutput == "uniform_average":
return float(np.mean(rmsle_per_output))
weights = np.asarray(multioutput, dtype=float)
if weights.shape != (rmsle_per_output.shape[0],):
raise ValueError("multioutput weights must match n_outputs.")
return float(np.average(rmsle_per_output, weights=weights))
y_true_rand = rng.lognormal(mean=1.2, sigma=0.9, size=(60, 3))
y_pred_rand = y_true_rand * rng.lognormal(mean=0.0, sigma=0.3, size=y_true_rand.shape)
print("ours raw:", root_mean_squared_log_error_np(y_true_rand, y_pred_rand, multioutput="raw_values"))
print("sk raw:", root_mean_squared_log_error(y_true_rand, y_pred_rand, multioutput="raw_values"))
sample_w = rng.uniform(0.5, 2.0, size=y_true_rand.shape[0])
print("ours weighted:", root_mean_squared_log_error_np(y_true_rand, y_pred_rand, sample_weight=sample_w))
print("sk weighted:", root_mean_squared_log_error(y_true_rand, y_pred_rand, sample_weight=sample_w))
assert np.allclose(
root_mean_squared_log_error_np(y_true_rand, y_pred_rand, multioutput="raw_values"),
root_mean_squared_log_error(y_true_rand, y_pred_rand, multioutput="raw_values"),
)
assert np.isclose(
root_mean_squared_log_error_np(y_true_rand, y_pred_rand, sample_weight=sample_w),
root_mean_squared_log_error(y_true_rand, y_pred_rand, sample_weight=sample_w),
)
# Negative values should raise (to match sklearn)
try:
root_mean_squared_log_error_np([0.0, 1.0], [0.0, -0.1])
except ValueError as e:
print("caught:", e)
ours raw: [0.2174 0.2333 0.2297]
sk raw: [0.2174 0.2333 0.2297]
ours weighted: 0.23046738688656432
sk weighted: 0.23046738688656432
caught: y_pred contains negative values; RMSLE/MSLE require y >= 0.
7) RMSLE as an objective: gradients and optimization#
Because the square root is monotonic, minimizing RMSLE is equivalent to minimizing MSLE.
Let \(\Delta_i = \log(1+\hat y_i) - \log(1+y_i)\). Then:
Derivative w.r.t. a prediction \(\hat y_i\) (for \(\hat y_i > -1\)):
For RMSLE:
Practical takeaway:
There is an extra factor \(\frac{1}{1+\hat y_i}\), so gradients are larger for small predictions.
A very common training trick is to optimize in log space: fit a model to \(t = \log(1+y)\) using standard squared error, then transform back with
expm1.
# Synthetic data with multiplicative noise (log-normal in y)
n = 400
x = rng.uniform(0.0, 6.0, size=n)
# True relationship in log1p-space
t = 1.5 + 1.0 * x + rng.normal(0.0, 0.35, size=n) # t = log1p(y)
y = np.expm1(t)
# Train/test split
perm = rng.permutation(n)
cut = int(0.8 * n)
tr, te = perm[:cut], perm[cut:]
x_tr, y_tr = x[tr], y[tr]
x_te, y_te = x[te], y[te]
fig = px.scatter(
x=x_tr,
y=y_tr,
opacity=0.7,
title="Synthetic regression data (y spans a wide range)",
labels={"x": "feature x", "y": "target y"},
)
fig.update_yaxes(type="log")
fig.show()
def predict_linear(x, w, b):
x = np.asarray(x, dtype=float)
return w * x + b
def fit_linear_mse_gd(x, y, *, lr=5e-4, steps=600):
"""Fit y ≈ w x + b by minimizing MSE on y (gradient descent)."""
x = np.asarray(x, dtype=float)
y = np.asarray(y, dtype=float)
w = 0.0
b = 0.0
n = x.shape[0]
hist = {"mse": [], "rmsle": [], "w": [], "b": []}
for _ in range(steps):
y_hat = predict_linear(x, w, b)
r = y_hat - y
mse = float(np.mean(r**2))
# RMSLE isn't defined for negative predictions in sklearn; clip for evaluation.
y_hat_clip = np.maximum(y_hat, 0.0)
rmsle = float(root_mean_squared_log_error_np(y, y_hat_clip))
grad_w = (2.0 / n) * float(np.dot(r, x))
grad_b = (2.0 / n) * float(np.sum(r))
w -= lr * grad_w
b -= lr * grad_b
hist["mse"].append(mse)
hist["rmsle"].append(rmsle)
hist["w"].append(w)
hist["b"].append(b)
return w, b, hist
def fit_log1p_mse_gd(x, y, *, lr=0.05, steps=600):
"""Fit log1p(y) ≈ w x + b (equivalent to optimizing MSLE/RMSLE)."""
x = np.asarray(x, dtype=float)
y = np.asarray(y, dtype=float)
t = np.log1p(y)
w = 0.0
b = 0.0
n = x.shape[0]
hist = {"mse_log": [], "mse_y": [], "rmsle": [], "w": [], "b": []}
for _ in range(steps):
t_hat = predict_linear(x, w, b) # model predicts log1p(y)
r = t_hat - t
mse_log = float(np.mean(r**2))
y_hat = np.expm1(t_hat)
y_hat = np.maximum(y_hat, 0.0)
mse_y = float(np.mean((y_hat - y) ** 2))
rmsle = float(root_mean_squared_log_error_np(y, y_hat))
grad_w = (2.0 / n) * float(np.dot(r, x))
grad_b = (2.0 / n) * float(np.sum(r))
w -= lr * grad_w
b -= lr * grad_b
hist["mse_log"].append(mse_log)
hist["mse_y"].append(mse_y)
hist["rmsle"].append(rmsle)
hist["w"].append(w)
hist["b"].append(b)
return w, b, hist
w_y, b_y, hist_y = fit_linear_mse_gd(x_tr, y_tr)
w_t, b_t, hist_t = fit_log1p_mse_gd(x_tr, y_tr)
y_hat_te_mse = np.maximum(predict_linear(x_te, w_y, b_y), 0.0)
y_hat_te_log = np.maximum(np.expm1(predict_linear(x_te, w_t, b_t)), 0.0)
print("Test RMSLE (fit MSE on y): ", root_mean_squared_log_error_np(y_te, y_hat_te_mse))
print("Test RMSLE (fit on log1p(y)):", root_mean_squared_log_error_np(y_te, y_hat_te_log))
print("Test RMSE (fit MSE on y): ", root_mean_squared_error(y_te, y_hat_te_mse))
print("Test RMSE (fit on log1p(y)):", root_mean_squared_error(y_te, y_hat_te_log))
Test RMSLE (fit MSE on y): 1.4927333701234642
Test RMSLE (fit on log1p(y)): 0.34749067593865945
Test RMSE (fit MSE on y): 358.24267736906455
Test RMSE (fit on log1p(y)): 155.56476300415895
df_hist = pd.DataFrame(
{
"step": np.arange(len(hist_y["rmsle"])),
"RMSLE (fit MSE on y)": hist_y["rmsle"],
"RMSLE (fit on log1p(y))": hist_t["rmsle"],
}
)
df_hist_long = df_hist.melt(id_vars="step", var_name="model", value_name="rmsle")
fig = px.line(
df_hist_long,
x="step",
y="rmsle",
color="model",
title="Training curves (RMSLE evaluated on the train set)",
labels={"rmsle": "RMSLE"},
)
fig.show()
df_pred = pd.DataFrame(
{
"y_true": np.concatenate([y_te, y_te]),
"y_pred": np.concatenate([y_hat_te_mse, y_hat_te_log]),
"model": np.repeat(
["fit MSE on y (linear)", "fit on log1p(y)"],
repeats=len(y_te),
),
}
)
eps = 1e-6
min_v = float(np.minimum(df_pred["y_true"].min(), df_pred["y_pred"].min()))
max_v = float(np.maximum(df_pred["y_true"].max(), df_pred["y_pred"].max()))
min_v = max(min_v, eps)
fig = px.scatter(
df_pred,
x="y_true",
y="y_pred",
color="model",
opacity=0.7,
title="Test predictions: y_true vs y_pred",
labels={"y_true": "true y", "y_pred": "predicted y"},
)
fig.add_trace(
go.Scatter(
x=[min_v, max_v],
y=[min_v, max_v],
mode="lines",
name="y = x",
line=dict(color="black", dash="dash"),
)
)
fig.update_xaxes(type="log")
fig.update_yaxes(type="log")
fig.show()
8) Practical usage notes (scikit-learn)#
If you want to optimize for RMSLE, a common baseline is:
transform targets with
log1pfit a standard regression model
invert predictions with
expm1
To avoid invalid values, clip predictions to \(\hat y \ge 0\) before computing RMSLE.
Scikit-learn provides TransformedTargetRegressor to make the log/exp transform explicit.
from sklearn.compose import TransformedTargetRegressor
from sklearn.linear_model import LinearRegression
X_tr = x_tr.reshape(-1, 1)
X_te = x_te.reshape(-1, 1)
model = TransformedTargetRegressor(
regressor=LinearRegression(),
func=np.log1p,
inverse_func=np.expm1,
)
model.fit(X_tr, y_tr)
y_pred_te = model.predict(X_te)
y_pred_te = np.clip(y_pred_te, 0.0, None)
print("sklearn RMSLE:", root_mean_squared_log_error(y_te, y_pred_te))
sklearn RMSLE: 0.3474906789145831
9) Pros, cons, and when to use RMSLE#
Pros
Focuses on multiplicative errors: being off by a factor matters more than being off by a constant
Handles targets spanning orders of magnitude (less dominated by large absolute values)
Natural when noise is approximately log-normal / heteroscedastic (variance grows with the mean)
Easy to optimize by modeling \(\log(1+y)\) and using squared error there
Cons
Requires non-negative targets and predictions (not suitable when \(y\) can be negative)
Can overweight small targets: mistakes near zero matter a lot
Reported value is in log units (less directly interpretable than RMSE/MAE)
If you train in log space and then invert with
expm1, predictions correspond more to a median than a mean in the original space (bias can appear)
Good default when
Targets are counts/prices/sales/traffic/demand and you care about relative error
Targets have a heavy right tail and you want evaluation that doesn’t get dominated by the largest cases
10) Common pitfalls and diagnostics#
Invalid negatives: RMSLE is not defined for negative values in most libraries; enforce \(\hat y \ge 0\) (model choice or clipping).
Zero-heavy targets: inspect performance separately on \(y=0\) vs \(y>0\); RMSLE can behave differently near zero.
Compare metrics: always compare RMSLE with RMSE/MAE; choose based on the cost of absolute vs relative errors.
Inspect residuals in log space: if you optimize for RMSLE, plot \(\log(1+\hat y) - \log(1+y)\), not only \(\hat y - y\).
Remember the +1: the “relative error” intuition is best when targets are not tiny.
Exercises#
Add support for
sample_weightand explicitmultioutputweights to the plotting examples (do some outputs matter more?).Create a dataset where the true noise is additive (not multiplicative) and compare RMSE vs RMSLE behavior.
Show that for large \(y\), MSLE is approximately the squared log-ratio: \((\log(\hat y/y))^2\).
References#
scikit-learn metrics API: https://scikit-learn.org/stable/api/sklearn.metrics.html
Kaggle discussions on RMSLE (common for count/price targets)